Enhancing Text Document Clustering Using Non-negative Matrix Factorization and WordNet
نویسندگان
چکیده
A classic document clustering technique may incorrectly classify documents into different clusters when documents that should belong to the same cluster do not have any shared terms. Recently, to overcome this problem, internal and external knowledge-based approaches have been used for text document clustering. However, the clustering results of these approaches are influenced by the inherent structure and the topical composition of the documents. Further, the organization of knowledge into an ontology is expensive. In this paper, we propose a new enhanced text document clustering method using non-negative matrix factorization (NMF) and WordNet. The semantic terms extracted as cluster labels by NMF can represent the inherent structure of a document cluster well. The proposed method can also improve the quality of document clustering that uses cluster labels and term weights based on term mutual information of WordNet. The experimental results demonstrate that the proposed method achieves better performance than the other text clustering methods.
منابع مشابه
Document Clustering Using Term Weights and Class Label Terms Based on Semantic Features
Clustering of class labels can be generated automatically, which is much lower quality than labels specified by human. In this paper, we propose a new enhancing document clustering method using terms of class label and term weights. The terms of class label can well represent the inherent structure of document clusters by non-negative matrix factorization (NMF). It can also improve the quality ...
متن کاملText Clustering using Semantic Terms
In traditional text clustering, documents appear terms frequency without considering the semantic information of each document (i.e., vector model). The property of vector model may be incorrectly classified documents into different clusters when documents of same cluster lack the shared terms. Recently, to overcome this problem uses knowledge based approaches. However, these approaches have an...
متن کاملBig Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing
Clustering of class labels can be generated automatically, which is much lower quality than labels specified by human. If the class labels for clustering are provided, the clustering is more effective. In classic document clustering based on vector model, documents appear terms frequency without considering the semantic information of each document. The property of vector model may be incorrect...
متن کاملA Novel Fast Non-negative Matrix Factorization Algorithm and Its Application in Text Clustering
In non-negative matrix factorization, it is difficult to find the optimal non-negative factor matrix in each iterative update. However, with the help of transformation matrix, it is able to derive the optimal non-negative factor matrix for the transformed cost function. Transformation matrix based nonnegative matrix factorization method is proposed and analyzed. It shows that this new method, w...
متن کاملDocument clustering using nonnegative matrix factorization
Amethodology for automatically identifying and clustering semantic features or topics in a heterogeneous text collection is presented. Textual data is encoded using a low rank nonnegative matrix factorization algorithm to retain natural data nonnegativity, thereby eliminating the need to use subtractive basis vector and encoding calculations present in other techniques such as principal compone...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Inform. and Commun. Convergence Engineering
دوره 11 شماره
صفحات -
تاریخ انتشار 2013